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Abstract 

A way of extracting French verbal chunks, inflected and infinitive, is explored and tested on ef- 
fective corpus. Declarative morphological and local grammar rules specifying chunks and some sim- 
ple contextual structures are used, relying on limited lexical information and some simple heuris- 
tic/statistic properties obtained from restricted corpora. The specific goals, the architecture and the 
formalism of the system, the linguistic information on which it relies and the obtained results on ef- 
fective corpus are presented. 

1 Introduction 

It is generally admitted that taggers or chunkers, either statistical or grounded on linguistic knowledge, 
will never reach perfect results 1 . However, it is also generally admitted that there are cases where, using 
only local contextual part-of-speech information and chiefly information coming from morphological 
items, deductions on categories or chunks can be made with utmost certainty 2 . 

In this situation, the crucial question is: what results can be obtained with such and such information ? 
or, conversely, what information does one need to obtain this particular result ? We here try to address 
these issues with the presentation of a system that has as its immediate target the identification of verbal 
chunks — finite and infinitive — in French. 

We will readily agree that this target is fairly limited, compared to that accomplished by general pur- 
pose taggers and chunkers. However, to dig into a problem, common practice is to split it into smaller 
ones: our goal is to identify as clearly as possible the possibilities and limits of different types of infor- 
mation with respect to specific structures, i.e., for the matter, verbal chunks 3 . 

Given this issue, the general and strong requirement which underlies the specification of our verbal 
chunk extractor is: do the more with the least. That is, we want to maximally reduce the resources 

*Thanks to Zulema Solana for her comments on a previous version of the paper, the responsability of which is entirely ours. 

'Abney 1 1 1 gives the following example: "to correctly disambiguate help in give John helppf versus let John helpv, one 
apparently needs to parse the sentences, making reference to the differing subcategorization frames of give and let." 

"Tapanainen and Voutilainen 1 1 1 1 discuss this point and give the following example: "we can safely exclude a finite-verb 
reading if the previous word is an unambiguous determiner." 

3 Our system belongs thus to partial parsing (cf. Manning & Schiitze |8 p. 375]) but with verbal chunks as the target of the 
analysis and not the most currently practiced nominal phrases. The reason of our choice is the potentiality of inferences on the 
syntactic analysis of the whole sentence that can be driven from analyzed verbal chunks; on the importance of correctly tagging 
verbs, see Chanod & Tapanainen J5J and footnote 6. 
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necessary to put the system into use, either lexical information or tagged corpus, but keeping in mind 
that this must not hamper final results on effective texts. Our general underlying hypothesis is : with (i) 
a simple linguistic modelization, chiefly relying on morphological expressions in verbal, nominal and 
prepositional chunks, plus (ii) very limited lexical resources, plus (iii) heuristic statistical information 
on forms not covered by the previous (i) and (ii), it is possible to obtain results standardly considered as 
good, i.e. around 98% of recall and precision 4 . 

As a consequence, the system is centrally rule-based, but is statistically completed. The satisfaction 
of the general requirement on resource poverty settles the balance between the two types of contributions. 
We use as much as possible a simple linguistic modelization, which exempts from the use of large tagged 
corpora, and we complete it with very simple statistical information. People more or less agree on the 
respective advantages and disadvantages of corpus statistics vs. linguistic knowledge (see Chanod & 
Tapanainen |5| and Bril [41 which represent opposite orientations) and both approaches can or must - 
ordinarily, they are - be put to work together, the most important difference coming on the weight given 
to each one. 

In general terms our system can be characterized as working in the opposite way of the transfomational- 
based learning way of tagging (on this, see Bril [ 4 1 , Charniak 1 6 1 , Manning & Schiitze 1 8 1 ) 5 . 

It is known that the transfomational way of tagging is organized in two steps. In the first step, a simple 
algorithm records for each word its most common part of speech in the training corpus. The tagging of 
new texts consists in simply assigning its most common tag to each word. The reported results of this 
level are 90% accuracy. In the second step, rules are proposed and evaluated in order to change the tags 
obtained in the first step into others intended to be more accurate. E.g. related to our issue, a rule can be: 

Change a verbal tag into a nominal tag if the previous word is a non ambiguous determiner. 

Transformations are not specified in terms of linguistic knowledge. Given |PoS| the total number 
of parts of speech, there are |PoS | 3 possible transformational rules of the previous type. The necessary 
training data - i.e. tagged corpus of several hundred thousands words - are then put into use. The system 
tries each possible rule on the training data measuring the accuracy the rule induces. And Charniak 
reports: 

Some rules make things worse, but many make things better, and we pick the rule that makes 
the accuracy the highest. Call this rule 1 . 

Next we ask, given we have already applied rule 1, which of the other rules does the best job 
of correcting the remaining mistakes. This is rule 2. We keep doing this until the rules have 
little effect. The result is an ordered list of rules, typically numbering 200 or so, to apply to 
new examples we want tagged. 

In our system, structural rules (see below section 5) express the simple needed linguistic modeliza- 
tion. They are the counterparts to the transformational rules for tags changing, but they are not learned 
by some algorithm and they are not used for erasing or modifying already obtained results coming from 
a previous step. Our heuristic/statistical information is the analogous of statistical information calculated 
on the first step of the transfomational-based learning strategy, with the big differences that it does not 
concern all words related to verbal chunks, but only the ones whose tags are not assigned by structural 
rules, and that our "training" corpus is extraordinary reduced compared with training corpora used in the 
statistical ways of doing things. 

4 Manning & Schiitze 1 8 p. 371] report the range of 95% to 97% for accuracy numbers calculated over all words; Charniak 1 6 
reports 97%; Habert & al. [7 recall the standard figures of 95% to 98%. Reported figures for French are 96.8% for a statistical 
tagger and 98.7% for a constrained based one with heuristic rules (see Chanod & Tapanainen |5 ]). In general it is difficult to state 
clearly revealing figures given the different parameters intervening in the tagging tasks. 

5 The system is thus, in spirit if not necessarily in expressive or covering power, in the line of Chanod & Tapanainen (3), the 
EngCG tagger 1131 . Tapanainen & Voutilainen 1 1 1, Tapanainen & Jarvinen 1101 . 
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In the following, section[2]specifies the targeted task of the system, section[3]presents the system ar- 
chitecture, section|4]describes linguistic resources and section[5]illustrates rules of the system. Section[6] 
is devoted to the analysis of results obtained by a detailed evaluation: we are not only interested by the 
obtention of global figures but also by a more detailed one, allowing to discriminate between the differ- 
ent types of parameters which contribute to global results. Particularly, we want to discriminate between 
results obtained thanks to the linguistic modelization and results obtained thanks to heuristic/statistical 
choices, to discriminate between missing resources and effective counter-examples, and we want also to 
be sure that "good" results do not necessarily come from a high or dominant percent of non ambiguous 
forms 6 . The perspectives of the system are discussed in the final section^ 

2 Task Definition 

In French, verbal chunks may include, in addition to a verb nucleus: 

• auxiliary verbs, of two types as in Jean a mange or in Jean est reparti, 

• clitic pronouns, as in Jean le lui prend, 

• negation particles, as in Jean ne parle pas, 

• adverbs, as in Jean a rapidement mange, 

• occasionally, incidental phrases or clauses, as in Jean a, par ailleurs, mange, 

• and, in infinitive verbal chunks, a preposition, as in II parle de le prendre. 

With the only exception of the negative particle ne, and of some reflexive forms, all the morpheme 
expressions in verbal chunks are ambiguous : le can be an article or a clitic pronoun, lui a nominative or 
a dative clitic pronoun, en a preposition or a clitic pronoun, pas a particle in negation constructions or 
a noun, etc. Futhermore, many verbal expressions are ambiguous: note can be a verb or a noun, avoir 
a noun or the infinitive of one of the two auxiliary forms, savoir a noun or an infinitive, entre a verb 
or a preposition, inverse a verb or an adjective, lui the past participle of the verb luire or the clitic or 
the nominative pronoun, soit a verb or a conjonction, etc. Besides this, verbal forms used as auxiliaries 
can also be used as main verbs as in Jean est triste (John is sad) ou Marie a la migraine (Mary has got 
a headache). But, despite this superabundance of several different kinds of ambiguities, for a human 
reader, effective texts are never ambiguous with respect to verbal chunks. 

One central challenge of verbal chunk extraction is thus to solve morphological and/or lexical and/or 
functional ambiguities. But we do not think that disambiguation of items in the whole sentence or, even, 
in the targetted structures, is a step which must necessarily be achieved before the obtention of chunks. 
In any case, we want to find out how far one can go with different kinds of limited information and 
combining disambiguation with chunk extraction 1 . 

6 Manning & Schiitze |8] p. 374-375] points that "some authors give accuracy for ambiguous words only in which the accuracy 
figures are of course below". In the literature reporting results it is not always possible to find discriminated results for ambiguous 
and not ambiguous forms. Even in Chanod & Tapanainen 1 5 | where actually some figures are discriminated, the ones on the overall 
performance of the reported system are not. The point is important from a methodological viewpoint; see Habert & al. 7 which 
points futhermore that tagging errors have quite different consequences on the syntactic analysis of the whole sentence depending 
on the kind of ambiguity: a tagging error on the noun/verb ambiguity will more significantly hamper the results on the syntactic 
sentence structure than a tagging error on the past participle/adjective ambiguity. 

7 Manning & Schiitze 1 8 p. 374-375], observes that although the interest in tagging comes from the belief that many engineering 
applications to language processing must benefit from syntactically disambiguated texts, "it is surprising that there seem to be 
more papers on stand-alone tagging than to applying tagging to a task of immediate interest". Our system aims at overcoming this 
criticism by combining tagging with chunk extraction. 
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3 System Architecture 



The system consists in two modules: Smorph, which performs tokenization and morphological analysis 
(Ait-Mokhtar 1 2 1), and Pasmo (Paulo et al. 1 9 1), which, given the output of Smorph, delimits sentences 
and, in the present case, verbal chunks, based on "recomposition rules" similar to rewriting rules. Smorph 
and Pasmo are related by a script interface tagging unknown words by Smorph. In addition, the system is 
associated to an evaluation software (Trouilleux 1 12 1) which can provide detailed evaluation results (cf. 
section |6j 8 . 

Given a text file, Smorph outputs a list of tokens of the form: 

surf ace_f orm [lemma, FVL] 

where FVL is a feature/value list. 

A Pasmo recomposition rule is of the form: 

tokeni . . . token„ -> token]/ . . . token„/ 

with the reading "a sequence of tokens matching the left-hand side of the rule is to be recomposed into 
the sequence of tokens specified in the right-hand side of the rule" (i.e. basically, the reading is the 
inverse of that of traditional rewriting rules). The Kleene + operator may be used in the left-hand side of 
the rule. In the new tokens produced by a rule, the surf ace_f orm and lemma elements may either 
be specified as specific strings of characters or as the concatenation of the sur f ace_f orm and lemma 
elements of the input tokens, while the feature/value pairs are to be specified as strings of characters only. 
Examples of rules are given in section|5] 

The data structure output by application of Pasmo rules is the same as that output by Smorph, al- 
lowing cyclic application: the rules are first applied to the output of Smorph, producing output 1, then 
applied again using output 1 as an input, producing output 2, then applied again, . . . until the input is such 
that no rule applies. 

The purpose of this paper is not so much to introduce a new chunking system, but rather to analyse 
the importance of different types of information in a chunking process related to verbal forms. We thus 
won't go into a detailled comparison of Smorph-Pasmo with existing systems. 

4 Lexical Linguistic Information 

The declarative source of Smorph is a lexicon in which the linguist declares lemmas and the set of 
feature/value pairs associated to each of them in terms of their inflected forms. It is possible to express 
in Smorph compound words, hyphenated, as in remue-menage {commotion) or not, as in pomme de terre 
(potatoe). 

Contrary to general practice, the basic choices with respect to our lexicon are: 

• With the exception of verbal forms which are ambiguous as f 1 (inflected) or pp (past participle), 
ambiguity is not expressed by the multiplication of the same forms. E.g. there are not two forms le, 
one associated to a value art [ icle ] and the other to a value cl [ it ic] , nor two forms note, 
one associated to n [ oun ] and the other to v [ erb ] ; instead there is a unique form le associated 
to the value ocl of the feature AMB [iguity] , which expresses the potential ambiguity of the 
clitic, and a unique form note associated to the value ov of the same feature, which expresses the 
potential ambiguity of the verb. 

• With the only exceptions of plural first and second person verb forms, no flexional values are 
associated to morphological or lexical forms 9 . 

8 Thanks to Mourad Sahi for his contribution with a program to compile results obtained on a set of files. 
9 The value is needed in order to distinguish contextually a subject nous, as in nous voulons (we want) from a clitic one, as in // 
nous parte (He talks to us). 
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• Lexical categories are drastically reduced. There is no noun category, and adjective entries are less 
than ten. 

In its present state, as far as expressions occurring inside verbal chunks are concerned, our lexicon 
contains the following forms: 

1 . all the verbal forms of our "training" and test corpus 10 , 

2. all the morphological forms which can appear in verbal chunks: clitic pronouns, ne, pas, preposi- 
tions in infinitive chunks (a, de, etc.), 

3. a subset of a restricted class of adverbs as ainsi, d'ailleurs, alors, 

4. a subset of multiword frozen adverbial expressions which incorporate forms which, in isolation, 
can be verbs (e.g. sans doute, n'importe qui, where doute and importe alone can be verbs). 

Items (1) and (2) are intended to be exhaustive, while this is not the case of items (3) and (4). Verbal 
forms are categorized as follows: 

• inflected forms f 1 are either categorized as ov (potentially ambiguous forms), as note, or catego- 
rized nv (not potentially ambiguous forms), as accepte; infinitive forms inf are either categorized 
as oin (potentially ambiguous forms), as devenir, or categorized nin (not potentially ambiguous 
forms), as assurer, 

• a subset of the potentially ambiguous inflected forms (category ov) is further categorized as hav- 
ing a statistically unlikely verb reading, (category jv, j [amais] v [verbe] , never verb), as 
empire, 

• past participles, whether they are ambiguous with respect to nouns (e.g. entree) or not (e.g. dormi), 
belong to a unique category pp, 

• verbal forms which are ambiguous as f 1 (inflected) or pp (past participle), e.g. fait, are associated 
in the lexicon to two different forms, one categorized as f 1 and the other as pp. 

In addition to this categorization, as stated above, first and second person verbal forms (and only 
those) are associated to flexional information in the form of a person and number feature. 

There are in all in the lexicon 1.731 verbal forms: 658 f 1, 595 pp, 478 inf. With respect to 
ambiguity, the significant figures are: 429 (about 2/3) f 1 forms are nv while 229 are ov; 440 (92%) 
inf forms are nin while 38 are oin n . 

In addition to forms occurring in verbal chunks, the lexicon also contains some prepositions, de- 
terminers which are not ambiguous with forms inside verbal chunks, as un, des, ce..., unambiguous 
possessives, as mon, ses..., a small set of adjectives (less than ten), a small set of pronouns, as il, on... 
and ponctuation marks. The forms other than ponctuation marks which are not verb forms and which 
can appear outside verbal chunks (this is not the case, e.g. of nominative or some reflexive clitics as -il, 
me, se), are all categorized as auf (au [tre] f [orme] , other form). There are in all 219 auf forms 
in the lexicon. 

With so limited lexical resources it is obvious that not all token forms in texts will be associated by 
Smorph to some feature/value categorization. In fact, in both the "training" and test corpus a little more 
than 1/3 of the forms are not recognized by Smorph and are associated by the script interface to [au, 
auf] feature values (au being a feature value not used in the lexicon). Overall, 2/3 of the tokens are 
associated to the most frequent feature value auf. 

10 In order to examine the minimal lexical information (in terms of categories) required for our proposed task, before running the 
system on a given corpus, we make sure that to all the verbal and potentially verbal forms in the corpus correspond an appropriate 
entry in the lexicon. In other words, lexical coverage is not evaluated here. We intend to evaluate hypotheses, not resources. 

"There were no examples of error expressions in the results concerning ambiguous pp forms. 
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The following is Smorph output for // la note 
' II' . 

[ 'il', ' TPRO' , ' pnom' , ' TPASS ',' auf '] . 
' la' . 

[ 'la', 'TFG','cl', 'AMB','ocl', ' TPASS' ,' auf '] . 
' note' . 

[ 'noter', 'TFV','v', 'MOD','fl', 'AMB','ov']. 

where, besides the already commented values auf, f 1, ov, the values pnom, cl, v, ocl stand for 
nominative pronoun, clitic, verb, ambiguous clitic, respectively. 



5 Rules 

Four basic types of rules are expressed in the same overall formalism: 

1 . Structural disambiguation rules of potentially ambiguous verbal forms (ov) which are actually not 
in chunks. 

2. Rules for the construction of verbal chunks 

(a) not potentially ambiguous, 

(b) potentially ambiguous. 

3. Rules for the disambiguation of potentially ambiguous chunks. 

4. Rules for the disambiguation of forms not disambiguated by (1) to (3). 

The rules are illustrated in the following in a semi-formal notation, where capital letters stand for 
variables and feature/value pairs are reduced to a minimum, i.e. a value. Table Ogives the key to the 
symbols used in the examples. 

Rules (1) concentrate on prepositional and nominal phrases, and on the initial sentence position. 

As an example, the rule 

X[det] Y[ov] — > X[det] Y[not-v] 
specifies that an ambiguous verbal form following a determiner is not a verb (e.g. ce juge). 



Lexical forms 


cl 


clitic 


ocl 


ambiguous clitic 


V 


verb 


ov 


ambiguous verb 


nv 


not ambiguous verb 


det 


determiner 


prep 


preposition 


pnom 


nominative pronoun 



Chunks 


vnfl 


not ambiguous verbal inflected chunk 


vnf la 


ambiguous verbal inflected chunk 



Other symbols 


not- 


negation prefix 


+ 


concatenation operator 



Table 1: Key to symbols used in the example rules. 
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Rules (1) are not intended to be exhaustive, but have been reduced to what was considered to be 
a strict and sufficient minimum in order to obtain the proposed target. Thus, it is apparent that many 
potentially ambiguous verb forms (ov) remain: their potential ambiguity is not resolved. 

Rules (2a) are specified around non ambiguous verbs (nv) and also around potentially ambiguous 
ones (ov), inasmuch these are contextually disambiguated internally in the verbal chunk by the presence 
of an unambigous form (e.g. ne). 

As an example, the rule 

X[cl] Y[cl] Z [v] — > X+Y+Z[vnfl] 

disambiguates le, lui and donne and constructs the chunk in Jacques [le lui donne] vn £i- 

Rules (2b) specify potentially ambiguous verbal chunks containing at least two forms. This situation 

typically arises when chunks are formed by a clitic pronoun and a verb, the two being ambiguous. 
As an example, the rule 

X[ocl] Y[ov] — > X+Y[vnfla] 

specifies a provisional ambiguous chunk for la note or en charge. 

A challenge for rules (2) is the identification of incidental phrases. Two basic types of incidental 
phrases were foreseen, one between commas and the other without, as in 

• // a, tres souvent, ete invite. 

• Ha tres souvent ete invite. 

They are respectively described by the two following structures making use of the Kleene + operator 

, auf+ , 
auf+ 

These straightforwardly say : incidental phrases in verbal chunks are constructed by one or more au f 
expressions, which can be preceded and followed by commas. Depending on the verbal chunk, the point 
of insertion of the incidental phrase can be one or another. Following the general strategy, not all possible 
incidental phrases were defined in our rules, but the general hypothesis is that the same structures are 
candidates to express all of them. 

Rules (2a) and (2b) are intended to be exhaustive within their respective scopes : Rules (2a) must 
build all non ambiguous verbal chunks and Rules (2b) all ambiguous verbal ones with two forms. After 
application of these rules, some verbal forms (the ov ones) and verb chunks (the ones builded by Rules 
(2b)) remain thus to be disambiguated. This is the job of Rules (3) and Rules (4). 

Rules (3) take advantage of the cyclic application of rules. After cycle I, the already specified vnf 1 
can be used to disambiguate already specified vnf la and ov verb forms.. 

As examples, the rules 

X[vnfl] Y[vnfla] — > X[vnfl] Y[not-vnfl] 
X[vnfl] Y[ov] — > X[vnfl] Y[not-vnfl] 

will respectively be in charge of the resolution of en charge in II [n 'est pas / vnf 1 en charge d'une solution, 
where en charge is not a verbal chunk, and of compte in II [tient] vnfl compte d'une solution, where 
compte is not a verbal form. By the same token but with opposite effects, the rules 

X[pnora] Y [vnf la] — > X[pnom] Y[vnfl] 
X[pnora] Y[ov] — > X[pnom] Y[vnfl] 

will respectively assign vnf 1 to la juge in II la juge and to compte in // compte venir. 
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It is at this point minimal statistics proceeds. Intuitively or using very simple heuristic techniques on 
very restricted texts, it is easy to know that, given the disambiguation performed by rules (1), ambiguous 
verbal chunks containing at least two forms (vnf la) are more frequent as nominal phrases than as 
verbal chunks (in particular as le, la, les are most likely to be determiners, and as en is more likely to be 
a preposition), and, conversely, that ambiguous verbal forms (ov) alone, excepted the j v ones, are more 
frequent as verbal chunks than as nouns. Rules (4) express this. 

We operate thus with rules that can be called structural, i.e. Rules (1) to (3), and with remaining 
Rules (4), which are poorly heuristically motivated rules solving the "other" cases . Our system tags 
as vnf 1-1 and vninf-I inflected and infinitive verbal chunks obtained by structural rules, and as 
vnf 1- 1 1 and vninf-I I inflected and infinitive verbal chunks obtained by the remaining rules. 

6 Results 

Our general underlying hypothesis can be now precisely stated: limited lexical resources as charac- 
terized in Section 4, plus the linguistically motivated structural rules (Rules (1) to (3)), plus simple 
statistical/heuristic rules (Rules (4)) can reach results in the targetted level (i.e. around 98% recall and 
precision). Effectively obtained results, which, after the explicitation of the background for analysing 
them, will be presented below, do not falsify the hypothesis. 

The obtained results must be analyzed in terms of the goals of the system on one hand, and in 
terms of the actual kinds of implemented and/or foreseen information on the other hand. We want to 
distinguish missing resources, which are foreseen but not implemented, from effective counter-examples 
to our hypothesis. E.g. if in a corpus there is an adverbial frozen expression or an incidental clause 
in a verbal chunk, foreseen but not implemented, we do not consider the errors resulting from such 
non-implementation as counter-examples to the hypothesis, but as missing resources. 

We know, with respect to forms inside verbal chunks, that verbal forms and morphological items are 
exhaustively implemented, while adverbs and adverbial frozen expressions are not (see section 4). We 
know futhermore that Rules (1) are not intended to be exhaustive, while Rules (2a) and (2b) are intended 
to be so, even if not all formally foreseen incidental phrases are effectively implemented. We know 
finally that a subset of ambiguous verbal (ov) forms were classified on intuitive heuristic grounds as jv 
(see section 4), and that the vnf la and ov forms, remaining as such after application of structural rules, 
are classified by Rules (4) as non-v forms and vnf 1 forms, respectively (see section 5). 

From these, it is possible to classify counter-examples as structural, i.e. related to structural rules and 
heuristic, i.e. related to Rules (4). Structural counter-examples are further classified as being produced 

(i) by incompleteness of Rules (1), 

(ii) by minor inaccuracies in the formulation of Rules (2), 

(iii) by lack of the expressive power of Rules (2). 

The limits of the system and its underlying general hypothesis are given by structural counter- 
examples produced by lack of expressive power of Rules (2) together with heuristic counter-examples. 
Intuitively, the system reaches its limits when it is not possible to modify it without modifying its general 
assumptions. 

To evaluate the system, we will use the classic recall and precision measures. In addition to these 
measures, we define SL (System Limits) as : 

SL = (1 - EE) * 100 

where EE is the ratio of 'error expressions' in a corpus test over the number of chunks actually observed 
in the corpus (i.e. the denominator of the recall measure). An error expression is either an expression 
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possible 


actual 


correct 


recall 


precision 


training 


infinitive 


392 


392 


392 


100.00 


100.00 


finite 


819 


822 


814 


99.39 


99.03 


all 


1211 


1214 


1206 


99.59 


99.34 


test state 1 


infinitive 


285 


283 


282 


98.95 


99.65 


finite 


668 


677 


651 


97.46 


96.16 


all 


953 


960 


933 


97.90 


97.18 


Table 2: Results on training and test corpus in state 1. 






possible 


actual 


correct 


recall 


precision 


test state 2 


infinitive 


285 


284 


284 


99.65 


100.00 


finite 


668 


673 


661 


98.95 


98.22 


all 


953 


957 


945 


99.16 


98.75 



Table 3: Results on test corpus in state 2. 



observed in the corpus test to which correspond one or more recall or precision errors 12 or an expression 
output by the system as a verbal chunk which do not correspond in part or in totality to an actually 
observed verbal chunk. 

Given this background, it is now possible to report and analyse effective results of our system. 

We used a 13,000 word corpus of finance news articles to tune our system during development. This 
corpus will be referred to as the "training" corpus. We emphasize that the "training" corpus is not a 
previously tagged one. It is used neither to extract statistical information on n-grams with n > 2 nor 
to learn rules from it. Structural rules and the lexical information which were needed were foreseen as 
working hypotheses. The contribution of the so-called "training" corpus is thus limited to the tuning of 
the system, to the verification of heuristic choices related to Rules (4) inasmuch they concern ov and jv 
forms, and to the detection of adverbs and adverbial frozen expressions incorporated to verbal chunks. 
See in Table 2 the results on the "training" corpus. 

The system underlying the results of the "training" corpus was applied to a 10,400 word, previously 
unseen, "test" corpus of the same source. After this application, we detected 13 adverbial expressions 
which were missing in incidental clauses and added to the lexical resources. We refer to the state of the 
system at this stage as "state 1". The system in state 1 was applied to both the "training" and the test 
corpora. Results are stated in Table 2. The possible and actual columns give the number of verbal chunks 
in the manually tagged corpus and in the output of the system, respectively. 

Analysis of results on the test corpus reveals two missing (but formally foreseen) Rules (2) for in- 
flected verbal chunks with incidental clauses, and four minor inaccuracies in the formulation of Rules 
(2) (two related to inflected chunks and two to infinitive ones). The two missing rules for inflected verbal 
chunks were added and the four verbal chunks rules modified. The system in State 2 was thus obtained. 
These modifications allowed the correct identification of 10 more finite verbal chunks and of 2 more 
infinitive verbal chunks (see Table[3}. On the "training" corpus, the system in state 2 produces the same 
results as in state 1, except for one additional error. 

12 In the sentence, Aucun groupe etranger n'a pour le moment fait connaitre son interet, one observes the finite verbal chunk n'a 
pour le moment fait. Given this sentence, the system actually identifies two finite verbal chunks n 'a and fait. We here have one 
recall error (the observed chunk is not identified) and two precision errors (the system identifies two chunks which are not observed 
as such in corpus), but only one error expression: n 'a pour le moment fait. The recall and precision errors are correlated and this is 
taken into account in the computation of SL. Note that SL ranges from 100 (perfect result) to -co. Negative values are obtained 
when the system produces more error expressions than there are observed expressions in the corpus. 
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label 


correct 


actual 


precision 


vninf-I 


273 


273 


100.00 


vninf-I I 


11 


11 


100.00 


vnfl-I 


615 


618 


99.51 


vnf l-II 


46 


55 


83.64 



Table 4: Precision figures for different rule types. 



The analysis of error expressions in the results of the application of the system in state 2 to the test 
corpus allows to obtain a first estimation of SL and its underlying general hypothesis. 

On the test corpus, 16 error expressions remain. They are associated to 20 recall or precision errors 
obtained by the system : 

• 2 error expressions concern both recall and precision (i.e. 4 overall errors), 

• 1 error expression 13 concerns 1 recall error and 2 precision errors (i.e. 3 overall errors), 

• 8 error expressions each concern a precision error only, 

• and 5 error expressions each concern a recall error only. 

Out of 8 error expressions concerning precision errors, 3 arise from incompleteness of Rules 1 . They 
affect finite verbal chunks but can be easily completed within the expressive power of the system. Thus 
they can be considered as counter-examples not affecting SL. 

All of the other error expressions (i.e. 13) lead to mistaken results which are truly counter-examples 
counting for the calculation of SL. 

The 3 error expressions affecting both precision and recall are produced by lack of expressivity 
of Rules (2), and as such, they are true counter-examples. They all involve finite verbal chunks with 
incidental clauses which are not marked by commas as in L'encours ...[a pour sa part plus que double]. 
The system cannot currently deal with such cases, because part, being a potential verbal form, is not an 
auf (see section 4). We know that these error expressions generate 7 recall or precision errors, all of 
them on finite verbal chunks. 

The 10 remaining error expressions which are true counter-examples are all produced by statisti- 
cal/heuristic failures. Only 1 of them is an infinitive verbal chunk (an error affecting recall). 4 affect the 
recall on finite verbal chunks and 5 the precision. 

Thus we have the overall ratio of counter-example error expressions EE ~ 0.014% (i.e. 13/953), 
which gives SL ~ 98.6%. 

It is also interesting to remark that in the Smorph output of the test corpus there are 464 potentially 
ambiguous inflected verbal forms ov over 673 effective verb inflected chunks (69%), and 38 potentially 
ambiguous infinitive verbal forms oin over 284 effective verb infinitive chunks (13.4%). This is impor- 
tant because the challenge for disambiguation arises from ov and oin forms. In the test corpus, figures 
of ov tokens over vnf 1 forms (i.e. 464/673) practically inverse the ones in the lexicon of nv over f 1 
forms (i.e. 429/659) (see section 4). 

The error expressions which are true heuristic counter-examples can thus be reported to the challenge 
forms, i.e. 9/464 for ov forms, and 1/34, for oin forms, which gives ~ 1.9% and ~ 2.6% respectively. 
We claim thus that the encouraging SL obtained does not come from the fact that the challenging forms 
are weekly represented in the test corpus. 

The verbal chunks obtained by the structural rules (1) to (3) are tagged vnf 1-1 and vninf-I: it 
means that these chunks are defined structurally as verbal chunks. The verbal chunks obtained by the 

13 The example in footnote [T2l pagel9l 
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rules (4) are tagged vnf 1-1 1 and vninf -II. These chunks are heuristically defined as verbal chunks. 
The glass box evaluation allows us to discriminate between the two sorts of tags and it is interesting to 
note that there is no significant difference between the precision obtained for both tags related to vn i n f . 
Table|4]reproduces the results obtained for these tags on the test corpus. 

7 Perspectives 

Neither all finite verbal chunks nor infinitive verbal chunks can be obtained by our system. However, 
it allows at this stage to reach a good level in the disambiguation task. Ongoing work deals with the 
evaluation of the cost, in terms of lexical resources and linguistic rules, to increase the results of the 
extraction of verbal chunks. It may be summarized by the following question: what kind and what 
quantity of new information does one need to significantly increase the results ? 
Besides this, we also plan to: 

• extend the system and develop our set of linguistic rules to perform the extraction of all verbal 
forms, including gerund and present participle chunks, 

• further describe the potential incidental clauses whithin verbal chunks, 

• extend the coverage of the system to verbal coordination of finite and infinitive verb, as well as 
participial forms in French, 

• take advantage of collateral benefits of the system: if verbal chunks are correctly specified, we 
may incidentally deduce that all occurrences of ambiguous forms not in verbal chunks must not be 
associated to the function or the category to which they are associated in verbal chunks. Eg., all 
lui forms not in verbal chunks and not following a preposition must be associated to a nominative 
role 14 . 

Future work will also consists in testing the system on corpora from various sources and of different 
types, the goal being to isolate different sets of rules with respect to their precision and effectiveness in 
terms of recall. This next system could then become a modular system which the user might use with 
higher precision/lower recall or lower precision/higher recall, depending on his needs. 

In other respects, we are also focused on the applicative dimension of a such system, and one appli- 
cation domain in which one would need high precision is teaching. In the field of language acquisition, 
our system may be used to set up a large variety of exercises using local contexts. The rule-based system, 
provided it is reliable, and the evaluation system can also be used in order to automatically detect the 
errors made by the students in the exercises. By integrating this system in a specific software, we are 
searching two goals: one is to link development in Natural Language Processing with the issue of French 
grammar teaching to foreign students and the other is to share the different concerns of NLP and teaching 
methods. 
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